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Feature  Extraction  using  an  Unsupervised  Neural  Network 


Nathan  Intrator 

Div.  of  Applied  Mathematics,  and  Center  for  Neural  Science 
Brown  University 
Providence,  RI  02912 


Abstract 

A  novel  unsupervised  neural  network  for  di¬ 
mensionality  reduction  which  seeks  directions 
emphasizing  distinguishing  features  in  the 
data  is  presented.  A  statistical  framework 
for  the  parameter  estimation  problem  asso¬ 
ciated  with  this  neural  network  is  given  and 
its  connection  to  exploratory  projection  pur¬ 
suit  methods  is  established.  The  network  is 
shown  to  minimize  a  loss  function  (projec¬ 
tion  index)  over  a  set  of  parameters,  yielding 
an  optimal  decision  rule  under  some  norm. 

A  specific  projection  index  that  favors  direc¬ 
tions  possessing  multimodality  is  presented. 

This  leads  to  a  similar  form  to  the  synap¬ 
tic  modification  equations  governing  learning 
in  Bienenstock,  Cooper,  and  Munro  (BCM) 
neurons  (1982). 

The  importance  of  a  dimensionality  reduc¬ 
tion  principle  based  solely  on  distinguishing 
features,  is  demonstrated  using  a  linguisti¬ 
cally  motivated  phoneme  recognition  exper¬ 
iment,  and  compared  with  feature  extrac¬ 
tion  using  principal  components  and  back- 
propagation  network. 

1  How  to  construct  optimal 

unsupervised  feature  extraction 

When  a  classification  of  high  dimensional  vectors  is 
sought,  the  curse  of  dimensionaliiy  (Bellman,  1961) 
becomes  the  main  factor  affecting  the  classification 
performance.  The  curse  of  dimensionality  problem  is 
due  to  the  inherent  sparsity  of  high  dimensional  spaces, 
implying  that  the  amount  of  training  data  needed  to 
get  reasonably  low  variance  estimators  is  ridiculously 
high.  One  approach  to  the  problem  is  to  assume  that 
important  structure  in  the  data  actually  lies  in  a  much 


sma'l<“r  dimencinj-.al  opace,  and  th"r'‘<’ore  to  r<“- 
duce  the  dimensionality  before  attempting  the  clas¬ 
sification. 

Hence,  the  desired  property  of  a  dimensionality  re¬ 
duction/feature  extraction  method  is  to  lose  as  lit¬ 
tle  information  as  possible  after  the  transformation 
from  the  high  dimensional  space  to  the  low  dimen¬ 
sional  one.  This  motivation  underlies  methods  such  as 
principal  components  (PC),  mutual  information  max¬ 
imization  (Linsker,  1986),  and  self  supervised  form  of 
back-propagation. 

At  a  first  glance,  it  seems  that  a  supervised  feature  ex¬ 
traction  method  will  always  be  superior  to  an  unsuper¬ 
vised  one,  because  if  one  has  more  information  about 
the  problem,  it  is  natural  to  suppose  that  finding  the 
solution  is  easier.  However,  unsupervised  methods  use 
a  local  measure  to  optimally  estimate  single  dimen¬ 
sional  functions  of  projections  instead  of  functions  of 
the  full  dimensionality  of  the  space,  and  therefore  tend 
to  be  less  sensitive  to  the  curse  of  dimensionality  prob¬ 
lem  (Huber,  1985). 

One  way  to  reduce  the  curse  of  dimensionality  is  to 
look  for  lower  dimensional  structures  (features)  by  us¬ 
ing  a  localized  and  smooth  objective  function  that  di¬ 
rectly  measures  the  importance  of  the  extracted  fea¬ 
tures. 

A  useful  class  of  features  to  explore  is  defined  by  sonte 
linear  projections  of  the  high  dimensional  data.  This 
class  is  used  in  projection  pursuit  methods  (PP)  orig¬ 
inally  introduced  by  Kruskal  (1969,  1972),  Switzer 
(1970,  1971),  and  later  implemented  by  Friedman  and 
Tukey  (1974).  These  methods  are  reviewed  in  Huber 
(1985). 

It  is  still  difficult  to  characterize  what  interesting  pro¬ 
jections  are,  although,  it  is  easy  to  point  at  projec¬ 
tions  that  are  uninteresting.  To  motivate  this  discus¬ 
sion,  consider  the  following  example  in  which  two  data 
clusters  lie  in  a  two  dimensional  space.  If  we  are  inter- 
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ested  in  reducing  the  dimensionality  of  the  data,  and 
still  retaining  an  indication  on  the  structure,  it  is  best 
to  project  the  data  onto  the  x  axis,  even  though  the 
variance  of  the  projection  to  the  y  axis  is  larger. 


Figure  1:  In  this  dimensionality  reduction 
problem  the  interesting  direction  is  not  the 
one  that  maximizes  the  variance:  Two  data 
clusters  which  can  be  separated  by  projecting 
to  the  X  axis,  can  not  be  separated  by  project¬ 
ing  to  the  y  axis,  although  the  variance  in  the 
y  axis  is  larger. 

Notice  that  in  the  above  example,  the  projection  onto 
the  X  axis  will  give  a  two  hump  distribution,  while 
the  projection  onto  the  y  axis  will  give  a  normal  dis¬ 
tribution.  It  turns  out  that  this  is  not  a  coincidence. 
A  statement  that  has  recently  been  made  precise  by 
Diaconisand  Freedman  (1984)  says  that  for  most  high¬ 
dimensional  clouds,  most  low-dimensional  projections 
are  approximately  normal.  This  finding  suggests  that 
the  important  information  in  the  data  is  conveyed  in 
those  directions  whose  single  dimensional  projected 
distribution  is  far  trom  Gaussian.  Friedman  (1987) 
argues  that  the  most  computationally  attractive  mea¬ 
sures  for  deviation  from  normality  (projection  indices) 
are  based  on  polynomial  moments.  For  example,  prin¬ 
cipal  components  extraction  uses  a  projection  index 
which  is  based  on  polynomials  of  the  second  moment 
of  the  projections  (maximizing  the  projected  variance). 
In  some  special  cases  where  the  data  is  known  in  ad¬ 
vance  to  be  bi-modal,  it  is  relatively  straightforward 
to  define  a  good  projection  index  (Hinton  (z  Nowlan, 
1990). 

Despite  their  computational  attractiveness,  projection 
indices  based  on  polynomial  moments  are  not  directly 
applicable,  since  they  very  heavily  emphasize  depar¬ 
ture  from  normality  in  the  tsdls  of  the  distribution  (Hu¬ 
ber,  1985).  Friedman  (1987)  addresses  this  issue  by 
introducing  a  nonlinear  transformation  that  squashes 
the  projected  data  from  R  to  [-1,1]  using  a  normal 
distribution  function.  We  address  the  problem  by  ap¬ 
plying  a  sit^mni^Rl  squash;  ue  funrfior.  to  the  projec¬ 
tions,  and  then  applying  an  objective  function  based 


on  polynomial  moments. 


2  Feature  Extraction  using  ANN 


In  this  section,  the  intuitive  idea  presented  above  is 
used  to  form  a  statistically  plausible  objective  function 
whose  minimization  will  find  those  projections  having 
a  single  dimensional  projected  distribution  that  is  far 
from  Gaussian. 

We  first  informally  describe  the  statistical  formulation 
that  leads  to  this  objective  function  (the  mathemati¬ 
cal  details  ate  left  to  the  appendix).  Based  on  statisti¬ 
cal  decision  theory,  a  neuron  is  considered  as  capable 
of  making  decisions.  The  most  intuitive  decision  for 
a  neuron  is  whether  to  fire  or  not  for  a  given  input 
and  vector  of  synaptic  weights.  To  aid  the  neuron 
in  making  the  decision,  a  loss  function  is  attached  to 
each  decision,  namely  a  function  that  measures  the  loss 
from  making  each  decision.  The  neurons  task  is  then 
to  choose  the  decision  that  minimizes  the  loss.  Since 
the  loss  function  depends  on  the  synaptic  weights  vec¬ 
tor  in  addition  to  the  input  vector,  it  is  natural  to 
se^k  a  synaptic  weight  vector  that  will  minimize  the 
sum  of  the  losses  associated  with  every  input,  or  more 
precisely,  the  average  loss  (also  called  the  risk).  The 
search  for  such  vector,  which  yields  an  optimal  synap¬ 
tic  weight  vector  under  this  formulation,  can  be  viewed 
as  learning  or  parameter  estimation.  In  those  cases 
where  the  risk  i'  a  smooth  function  its  minimization 
can  be  done  using  gradient  descent. 

The  ideas  presented  so  far  make  no  specific  assump¬ 
tions  regarding  the  loss  function,  and  it  is  clear  that 
different  loss  functions  will  yield  different  learning  pro¬ 
cedures.  For  example,  if  the  loss  function  is  related  to 
the  inver.se  of  the  projection  variance  (including  some 
normalization)  then  minimizing  the  risk  will  yield  di¬ 
rections  that  maximize  the  variance  of  the  projections, 
i.e.  will  find  the  principal  components. 

Before  presenting  our  version  of  the  loss  function,  let 
us  review  some  necessary  notations  and  assumptions. 
Consider  a  neuron  with  input  vector  i  =  (ii , . .  . ,  x.v ), 
synaptic  weights  vector  m  =  (mj , . . . ,  m,v),  both  in 
and  activity  (in  the  linear  region)  c  —  i  ■  m.  De¬ 
fine  the  threshold  ©„,  =  £'[(T-m)‘],  and  the  functions 
<^(c,©m)  =  c-  -  |c©„,  </)(c,©,„)  -  -  |c0„.  The 

4>  function  have  been  suggested  as  a  biologically  plau¬ 
sible  synaptic  modification  function  to  explain  visual 
cortical  plasticity  (Bienenstock,  C'ooper  and  Munro, 
1982).  The  main  features  of  BCM  theory  will  be  dis¬ 
cussed  below.  0—  is  a  dynamic  threshold  which  wiil 
be  shown  later  to  have  an  affect  on  the  sign  of  the 
synaptic  modification.  The  input  x.  which  is  a  stochas- 
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tic  process,  is  assumed  to  be  of  Type  II  mixing', 
bounded,  and  piecewise  constant.  These  assumptions 
are  plausible,  since  they  represent  the  closest  continu¬ 
ous  approximation  to  the  usual  training  algorithms,  in 
which  trtiining  patterns  are  presented  at  random.  The 
9  mixing  property  allows  for  some  time  dependency  in 
the  presentation  of  the  training  patterns.  The  assump¬ 
tion  are  needed  for  the  approximation  of  the  resulting 
deterministic  gradient  descent  by  a  stochastic  one  (In- 
trator,  1990b).  For  this  reason  we  use  a  learning  rate 
fi  that  has  to  decay  in  time  so  that  this  approxima¬ 
tion  is  valid.  Note  that  at  this  point  c  represents  the 
linear  projection  of  x  onto  m,  and  we  seek  an  optimal 
projection  in  some  sense. 

Our  projection  index  is  aimed  at  finding  directions 
for  which  the  projected  distribution  is  far  from  Gaus¬ 
sian,  more  specifically,  we  are  interested  in  finding 
clusters  in  a  high  dimensional  data.  Since  high  di¬ 
mensional  clusters  have  a  multimodal  projected  dis¬ 
tribution,  our  aim  is  to  find  a  projection  index  (loss 
function)  that  emphasizes  multimodality.  For  compu¬ 
tational  efficiency,  we  would  like  to  base  the  projec¬ 
tion  index  on  polynomial  moments  of  low  degree.  Us¬ 
ing  second  degree  polynomials,  one  can  get  measures 
of  the  mean  and  variance  of  the  distribution,  which 
do  not  give  information  on  multimodality,  therefore, 
higher  order  polynomials  are  necessary.  Furthermore, 
the  projection  index  should  exhibit  the  fact  that  bi- 
modal  distribution  is  already  interesting,  and  any  ad¬ 
ditional  mode  should  make  the  distribution  even  more 
interesting. 

With  this  in  mind,  consider  the  following  family  of  loss 
functions  which  depend  on  the  synaptic  weight  vector 
and  on  the  input  x  (the  derivation  based  on  decision 
theory  appears  in  the  appendix). 


,(i  m)  __ 

Tfn(^)  =  -p  /  4>(s,Qrn}ds 


'^he  motivation  for  this  loss  function  can  be  seen  in 
the  following  gt;..ph,  which  represents  the  </>  function 
and  the  associated  loss  function  Lm(-r).  For  simplicity 
the  loss  for  a  fixed  threshold  ©„  and  synaptic  vector 
m  can  be  written  as  Lm(c)  =  -^c^(c  -  0^),  where 
c  =  (x  ■  m). 


'The  ¥>  mixing  property  specifies  the  dependency  of  the 
future  of  the  process  on  its  past. 


Figure  2:  The  function  d)  and  the  loss  func¬ 
tions  for  a  fixed  m  and  0m  ■ 

The  graph  of  the  loss  function  shows  that  for  any 
fixed  rn  and  ©„,  the  loss  is  small  for  a  given  input 
X,  when  either  c  =  x  ■  m  is  close  to  zero,  or  when 
X  ■  Til  is  larger  than  |©m-  Moieovci,  the  loss  function 
remains  negative  for  (x  •  ni)  >  i©,,,,  therefore  any 
kind  of  distribution  at  the  right  hand  side  of  |©m  is 
possible,  and  the  preferred  ones  are  those  which  are 
concentrated  further  from  5©m. 

It  remains  to  show  why  it  is  not  possible  that  a  mini- 
mizer  of  the  average  loss  will  be  such  that  all  the  mass 
of  the  distribution  will  be  concentrated  in  one  of  the  re¬ 
gions.  Roughly  speaking,  this  can  not  happen  because 
the  threshold  ©„,  is  dynamic  and  depends  on  the  pro¬ 
jections  in  a  nonlinear  way,  namely,  ©„,  =  E(x  ■  m)~. 
This  implies  that  0m  will  always  move  itself  to  a  po¬ 
sition  such  that  the  distribution  will  never  be  concen¬ 
trated  at  only  one  of  its  sides.  This  yield  that  the  part 
of  the  distribution  for  c  <  |©,n  has  high  loss,  mak¬ 
ing  those  distributions  in  which  the  distribution  for 
c  <  |0Tn  has  its  mode  at  zero,  more  plausible. 

The  fact  that  the  distribution  has  part  of  its  mass  on 
both  sides  of  |©m  makes  it  already  a  plausible  projec¬ 
tion  index  that  seeks  multi-  modalities.  However,  this 
projection  index  will  be  more  general,  if  in  addition, 
the  loss  will  be  insensitive  tc  outliers,  if  we  allow  any- 
projected  distribution  to  be  shifted  so  that  the  part  of 
the  distribution  that  satisfies  c  <  w-ill  have  its 

mode  at  zero.  These  points  will  be  discussed  below. 

The  risk  (expected  value  of  the  loss)  is  given  by: 

Rm  =  -•^EUr  ■  m)3  -  E\(x  •  m)^l(x  •  inlH 

Since  the  risk  is  continuously  differentiable,  its  min¬ 
imization  can  be  achieved  via  a  gradient  descent 
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method  with  respect  to  m,  namely; 

■  Tn)^]£;[(*  •  m)i.]} 
=  A*  •  m,©„)x,]. 

The  resulting  differential  equations  suggest  a  modified 
version  of  the  law  governing  synaptic  weight  modifica¬ 
tion  in  the  BCM  theory  for  learning  and  memory  (Bi- 
enenstock,  Cooper  and  Munro,  1982).  This  theory  was 
presented  to  account  for  various  experimental  results 
in  visual  cortical  plasticity.  According  to  this  theory, 
the  synaptic  efficacy  of  active  inputs  increases  when 
the  postsynaptic  target  is  concurrently  depolarized  be¬ 
yond  a  modification  threshold,  ©„.  However,  when  tl.e 
level  of  postsynaptic  activity  falls  below  ©„,  then  the 
strength  of  active  synapses  decreases.  An  important 
feature  of  this  theory  is  that  the  value  of  the  modifica¬ 
tion  threshold  is  not  fixed,  but  instead  varies  as  a  non¬ 
linear  function  of  the  average  output  of  the  postsynap¬ 
tic  neuron.  This  feature  provides  the  stability  proper¬ 
ties  of  the  model,  for  positive  or  mean  positive  inputs, 
and  is  necessary  in  order  to  explain,  for  example,  why 
the  low  level  of  postsynaptic  activity  caused  by  binoc¬ 
ular  deprivation  does  not  drive  the  strengths  of  all  cor¬ 
tical  synapses  to  zero.  Mean  field  theory  for  a  network 
based  on  these  neurons  is  presented  in  (Scofield  and 
Cooper,  1985;  Cooper  and  Scofield,  1988),  statistical 
analysis  is  given  in  Intrator  (1990c)  computet  simula¬ 
tions  and  biological  relevance  are  discussed  in  (Soul  et 
al.,  1986;  Beat  et  al.,  1987;  Cooper  et  al.,  1987;  Beat 
et  al.,  1988;  Clothioux,  1990). 

Up  to  this  point  we  have  presented  an  unsupervised 
(exploratory)  method  for  feature  extraction  that  seeks 
projections  in  which  the  single  dimensional  distribu¬ 
tion  is  multi-modal,  namely  we  have  presented  an  ex¬ 
ploratory  projection  pursuit  method.  This  method 
uses  polynomial  moments  as  a  projection  index  and 
therefore  suffers  from  over-sensitivity  to  outliers  (Frei- 
dman,  1987).  We  address  this  problem  by  considering 
a  nonlinear  neuron  in  which  the  neuron’s  activity  is  de¬ 
fined  to  be  c  =  a(x  ■  m),  where  cr  usually  represents  a 
smooth  sigmoidal  function.  A  more  general  definition 
that  would  allow  symmetry  breaking  of  the  projected 
distributions,  will  provide  solution  to  the  second  prob¬ 
lem  raised  above,  and  is  still  consistent  with  the  sta¬ 
tistical  formulation  is  c  =  <t(x  -m  —  a),  for  an  arbitrary 
threshold  a  which  can  be  found  by  using  gradient  de¬ 
scent  as  well.  For  the  nonlinear  neuron  ©m  is  defined 
to  be  w  —  E[cr^(x  ■  ni}].  The  loss  function  is  given 
by: 

Tt„(*)  =  -P  /  ^(s,Qm)ds 

=  -j  m)  -  E[<t^(x  ■  Tn)]<r-(x  ■  m)} 


The  gradient  of  the  risk  becomes; 

=  fj  {E[cr^(x  ■  mjtr'x'j 
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—  £’[a‘(x  •  m)]£'[(T(x  •  mjtr'x  } 
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=  p  £'[(ji^(T(x  •  m),  ©m^cr'i], 

where  tr'  represents  the  derivative  of  cr  at  the  point 
(x  -  m).  Note  that  the  multiplication  by  a'  reduc^'s 
sensitivity  to  outliers  of  the  differential  equation  since 
for  outliers  cr'  is  close  to  zero. 

Based  on  this  formulation,  a  network  of  Q  identical 
nodes,  which  receive  the  same  input  and  inhibit  each 
other,  may  be  constructed  in  order  to  extract  several 
features  at  once.  A  similar  network  has  been  studied 
by  Scofield  and  Cooper  (1985).  The  activity  of  neuron 
h  in  the  network  is  defined  as  c^  =  x  •  m^,  where  is 
the  synaptic  weight  vector  of  neuron  k.  The  inhibited 
activity  and  threshold  of  the  A:'th  neuron  are  given  by 

Ck~Ck-V^Cj,  ©^  =  E[cjj. 

Schematic  structure  of  the  network  is  given  in  Figure 
3. 


Figure  3:  The  activity  of  a  nonlinear  neuron 
j  is  given  by  Cj  —  cr(x  ■  m^),  the  inhibited 
activity  is  given  by  Cj  =  Cj  —  c*. 


We  omit  the  derivation  of  the  synaptic  modification 
equations  which  is  similar  to  the  one  for  a  single  neu- 
?.i.d  present  only  the  Tr.>ujiing  ini'd'iic<«t;,^..  crpm- 
tions  for  a  synaptic  vector  m*  in  a  lateral  inhibition 
network  of  nonlinear  neurons: 

rhk  =  -p  £'{0(c)fc,©^)((T'(x  •  m*) 


4 


7^* 


The  full  derivation  can  be  found  in  Intrator  (1990a) 
The  lateral  inhibition  network  performs  a  direct  search 
of  k'dimensional  projections  together,  which  may  fir.d 
a  richer  structure  that  a  stepwise  approach  may  miss, 
e.g.  see  example  14.1  Huber  (1985). 


3  Comparison  with  other  feature 
extraction  methods 


The  problem  of  feature  extraction  for  classification  is 
in  some  sense  easier  than  that  of  feature  extraction 
for  density  or  function  estimation.  This  is  because  the 
only  interesting  features  in  such  case  are  those  that  dis¬ 
tinguish  between  a  finite  set  of  classes.  The  common 
features,  namely  those  features  that  do  not  help  in 
making  the  distinction  between  cl2isses  are  uninterest¬ 
ing,  even  though  they  may  be  very  important  for  data 
compression,  e.g.  the  self  supervised  back-propagation 
network  in  which  the  number  of  hidden  units  is  smaller 
than  the  number  of  input  and  output  units  (Elman  & 
Zipser,  1989).  The  network  presented  in  the  previous 
sections  has  been  shown  to  seek  multimodality  in  the 
projected  distributions,  which  translates  to  clusters  in 
the  original  space,  and  therefore  to  find  those  direc¬ 
tions  that  make  a  distinction  between  different  sets  in 
the  training  data. 

In  this  section  we  explore  the  differences  in  clas¬ 
sification  performance  between  a  network  that  per¬ 
forms  dimensionality  reduction  (before  the  classifica¬ 
tion)  based  upon  distinguishing  features,  and  a  net¬ 
work  that  performs  dimensionality  reduction  based 
upon  minimization  of  misclassification  error.  The  per¬ 
formance  of  the  different  methods  will  be  compared 
on  a  specific  classification  task;  a  phoneme  classifi¬ 
cation  experiment  whose  linguistic  motivation  is  de¬ 
scribed  below. 

We  looked  at  the  six  stop  consonants  [p,k,t,b,g,d] 
which  have  been  a  subject  of  recent  research  in  eval¬ 
uating  neural  networks  for  phoneme  recognition  (see 
review  in  Lippmann,  I3C0).  These  stops  posses  several 
common  features,  but  only  two  distinguishing  phonetic 
features,  place  of  articulation  and  voicing  (table  1)  (see 
Blumstein  k.  Lieberman  for  a  review  and  related  ref¬ 
erences  on  phonetic  feature  theory). 


Place  of  Articulation  { 

Velar 

Alveolar 

Labial  j 

V’oiced 

[g] 

[d] 

[b] 

Unvoiced 

|ki 

[t] 

[Pj 

Table  1:  The  two  distinguishing  phonetic  fea¬ 
tures  between  the  six  stop  consonants. 


The  Linguistic  information  in  the  table  suggests  the 
following  experiment:  A  network  is  to  be  trained  to 
reduce  dimensionality  from  the  unvoiced  stops  .'p.k.t,'. 
In  order  to  reduce  variability  in  the  data,  only  a  single 
speaker  and  a  single  vowel  context  is  used.  Therefore, 
the  only  distinguishing  features  in  the  training  data 
are  associated  with  place  of  articulation,  since  the  fea¬ 
tures  that  are  speaker  dependent,  voicing  dependent, 
or  context  dependent  belong  to  the  set  of  common  fea¬ 
tures  in  the  training  data.  A  dimensionality  reduc¬ 
tion  method  that  concentrates  mainly  on  distinguish¬ 
ing  features  should  find  only  the  features  associated 
with  place  of  articulation,  and  therefore  become  in¬ 
sensitive  to  voicing  dependent  and  speaker  dependent 
features,  which  are  the  common  features  in  the  train¬ 
ing  data.  This  can  easily  be  tested  by  evaluating  the 
performance  on  place  of  articulation  classification  of 
voiced  stops  and  data  from  other  speakers. 

For  comparison,  we  have  attempted  to  extract  features 
using  three  methods;  principal  components,  back- 
propagation,  and  the  above  unsupervised  network, 
all  trained  and  tested  on  the  same  data.  In  back- 
propagation,  the  only  supervised  method,  the  place  of 
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articulation  phonetic  feature  was  used  as  a  supervisor. 


Figure  4:  The  six  stop  consonants  followed  by 
the  vowel  [a]  for  male  speaker  BSS.  Their  or¬ 
der  from  bottom  to  top  is  [paj  [ka]  [ta]  [ba]  [ga] 
[da].  Each  token  is  represented  by  a  20  con¬ 
secutive  time  windows  of  32msec  with  30msec 
overlap.  In  each  time  frame  a  set  of  22  en¬ 
ergy  levels  in  Zwicker  critical  band  filters  are 
computed.  Notice  the  significant  difference 
between  the  voiced  and  the  unvoiced  images. 


The  speech  data  consists  of  20  consecutive  time  win¬ 
dows  of  32msec  with  30mSec  overlap,  aligned  to  the 
beginning  of  the  burst.  In  each  time  window,  a  set 
of  22  energy  levels  is  computed.  These  energy  levels 
correspond  to  Zwicker  critical  band  filters  (Zwicker, 
1961). 

The  consonant- vowel  (CV)  pairs  were  pronounced  in 
isolation  by  native  American  speakers  (two  male  BSS 
and  LTN,  and  one  female  JES.)  Five  tokens  of  each  of 
the  CV  pairs  used  for  training  are  presented  in  Fig¬ 
ure  4.  Additional  detsdls  on  biological  motivation  for 


the  preprocessing,  and  linguistic  motivation  related  to 
child  language  acquisition  can  be  found  in  Seebach 
(1990),  iind  Seebach  and  Intrator  (1990). 


Figure  5:  The  six  stop  consonants  followed 
by  the  vowel  [a]  for  female  speaker  JES.  Their 
order  from  bottom  to  top  is  jpaj  [ka^  [taj  iba] 

[gaj  [da^.  Pre-processing  is  the  same  as  above. 
Notice  that  the  same  burst  that  appear  in  [ta] 
is  clear  in  the  [da]  as  well. 

Figure  5  presents  five  tokens  of  each  of  the  CV  pairs 
pronounced  by  the  female  speaker  JES.  The  classifica¬ 
tion  results  obtained  using  BCM  network  and  princi¬ 
pal  components  methods,  were  better  on  this  speaker, 
than  on  those  obtained  when  testing  the  performance 
on  the  speaker  that  was  used  in  the  training.  This  is 
due  to  the  very  'clean'  sound  that  corresponds  closely 
to  the  acoustic  features  that  ate  known  (Blumstein  L 
Lieberman,  1984)  to  exist  in  these  sounds.  For  exam¬ 
ple,  this  was  the  only  speaker  out  of  several  that  we 
tested,  in  which  the  high  frequency  burst  (top  left  cor¬ 
ner)  is  cleat  for  the  voiced  stop  as  it  is  cleat  for  the 
unvoiced  stops. 
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The  unsupervised  feature  extraction/classification 
method  is  presented  in  Figure  6.  Similar  approach  us¬ 
ing  the  RCE  and  back-propagation  network  have  been 
carried  out  by  several  researchers  (Rimey  et  al.,  1986; 
Reilly  ct  al.,  1987,  1988;  Zemani  et  al.,  1989),  and 
using  the  unsupervised  charge  clustering  network  by 
Scofield  (1988) 

Five  features/directions  were  extracted  from  the  440 
dimensional  preprocessed  speech  vectors.  These  fea¬ 
tures  were  the  activation  of  five  neurons  in  the  unsu¬ 
pervised  network,  the  five  principal  components  in  the 
PC  method,  and  the  five  hidden  unit  activations  in 
back-propagation.  The  extracted  features  were  used 
to  train  a  k-NN  classifier  (with  1:  =  3)  to  classify  place 
of  articulation.  Although  the  three  dimensionality  re¬ 
duction  methods  were  trained  only  with  the  unvoiced 
tokens  of  a  single  speaker,  the  five  dimensional  k-NN 
classifier  was  trained  on  voiced  and  unvoiced  data  from 
the  other  speakers  as  well. 


Classification  using  feaUirc 
[txlraclion  Nciwork 
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L'niu 

Figure  6;  Low  dimensional  k-NN  classifier  is 
trained  on  the  features  extracted  from  the 
high  dimensional  data.  Training  of  the  feature 
extraction  network  stops,  when  misclassifica- 
tion  rate  drops  below  a  predetermined  thresh¬ 
old  on  either  the  same  training  data  (cross  val- 
idatory  test)  or  on  a  different  testing  data. 

The  classification  results  are  summarized  in  table 


2.  Several  observations  can  be  made  from  the  results; 
First,  the  principal  components  dimensionality  reduc¬ 
tion  is  clearly  not  sufficient  in  discovering  structure 
for  this  kind  of  data,  suggesting  that  the  structure  is 
highly  non  linear.  Second,  the  back-propagation  net¬ 
work  is  doing  well  in  finding  structure  useful  for  clas¬ 
sification  of  the  trained  data,  but  this  structure  does 
not  concentrates  on  distinctive  features  solely,  it  also 
contains  speaker  dependent  and  voicing  dependent  fea¬ 
tures,  and  therefore  has  degraded  classification  perfor¬ 
mance  when  tested  on  voiced  data,  or  data  from  other 
speakers.  This  can  also  be  viewed  as  a  generalization 
problem,  in  which  case  one  can  say  that  the  network 
is  overfitting  to  the  training  data.  Third,  classification 
results  using  the  BCM  network  for  dimensionality  re¬ 
duction  suggest  that  for  this  specific  task,  structure 
that  is  less  sensitive  to  voicing  features  can  be  ex¬ 
tracted,  even  though  the  network  was  trained  on  the 
unvoiced  data  only  and  voicing  has  significant  effects 
on  the  speech  signal  itself. 


1  Place  of  Articulation  Classification  1 

P-C 

B-P 

BCM  i 

BSS  /p,k,t/ 

66.0 

100.0 

98.6 

BSS  /b,g,d/ 

57.4 

73.3 

94.0 

LTN  /p,k,t/ 

60.0 

95.8  ! 

98.9 

LTN  /b,g.d/ 

46.6 

66.7 

1  90.0  ! 

JES  (Both) 

70.6  ! 

83.7 

99.4  1 

Table  2:  Percentage  of  correct  classifica¬ 
tion  of  place  of  articulation  in  voiced  and 
unvoiced  stops  using  principal  components, 
back-propagation,  and  the  BCM  network. 
Training  for  dimensionality  reduction  was 
done  on  unvoiced  stops  of  male  speaker  BSS  in 
all  three  experiments.  LTN  is  a  male  speaker 
aswell.  The  result  in  the  last  column  repre¬ 
sents  testing  on  both  the  voiced  and  unvoiced 
stops  of  a  female  speaker  (JES).  The  results 
represent  an  average  result  of  several  trials, 
which  differ  only  in  the  initial  conditions  of 
the  networks. 

4  Discussion 

It  has  been  shown  that  the  BCM  neuron  is  capable 
of  effectively  discovering  nonlinear  structures  in  high 
dimensional  spaces.  When  compared  with  other  pro¬ 
jection  indices,  the  highlights  of  the  presented  method 
are  i)  the  projection  index  concentrates  on  directions 
where  the  separability  property  as  well  as  the  non¬ 
normality  of  the  data  is  large,  thus  giving  rise  to  bet- 
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ter  classification  properties;  ti)  the  degree  of  correla¬ 
tion  between  the  directions  (features)  extracted  by  the 
network  can  be  regulated  via  the  global  inhibition,  al¬ 
lowing  some  tuning  of  the  network  to  different  types  of 
data  for  optimal  results;  Hi)  the  pursuit  is  done  on  all 
the  directions  at  once  thus  leading  to  the  capability  of 
finding  mote  interesting  structures  than  methods  that 
find  only  one  projection  direction  at  a  time. 

Regarding  the  speech  experiment,  the  network  and 
its  trmning  paradigm  present  a  different  approach  to 
speaker  independent  speech  recognition.  In  this  ap¬ 
proach  the  speaker  variability  problem  is  addressed  by 
training  a  network  that  concentrates  mainly  on  the  dis¬ 
tinguishing  features,  on  a  single  speaker,  as  opposed 
to  training  a  network  that  concentrates  on  both  the 
distinguishing  and  common  features,  on  multi-speaker 
data. 
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Mathematical  Appendix 

In  this  section  we  develop  the  statistical  formulation 
that  yields  the  loss  function  presented  in  section  2. 

Let  be  a  probability  space  on  the  space  of 

inputs  U  with  probability  law  P.  Let  .4  {0.1}  be 

a  decision  space,  in  the  case  of  a  single  neuron  a  zero 
decision  means  that  the  neuron  does  not  fire.  Let  tn 
be  a  vector  of  parameters  such  as  the  one  described 
above,  and  assume  that  it  lies  in  a  compact  space  R 'U 
This  parameter  space  def  ties  a  family  of  Icjss  functions. 
{  Lm  }.nf  BS' .  '<  A  s-  H.  Let  V  be  the  space 

of  all  decisii.m  rules.  The  ent|nrical  risk  (average  Ic.ss) 
Rm  :  P  —  R.  is  given  by: 

n 

VP(x"')/„,(x*".^(x'")) 

I  -  1 

For  a  fixed  in.  the  optimal  decision  6m  i*'  chosen  so 
that; 

flm{6m)  -  min{R„,(.<')} 

V 

S.  .ce  this  minimization  takes  place  over  a  finite  set. 
the  niinimizer  exists.  In  particular,  for  a  given  x'"  the 
decision  ^,n(x"')  is  chosen  so  that 

Tm(.r<'\6m(j'''^))  <  Lm(^"\}  -^™(x'”)). 

Al  this  point  Rm{6m)  is  a  risk  function  that  depends 
only  on  the  vector  of  parameters  ni,  and  assuming  h\„ 
is  bounded,  it  is  natural  to  seek  a  parameter  lii  that 
minimizes  7?„, ,  namely, 

R  m  {6m  )  —  ttl  in  {Rml^-m)}- 

The  minimum  with  respect  to  ni  exists  since  is 

compact,  and  Rm  >s  bounded.  When  m  represents  a 
vector  in  R^ ,  Rm  fan  be  viewed  as  a  projection  index. 

Based  on  the  above,  let  m.  the  synaptic  weight  vector, 
be  the  parameter  to  be  esimaled,  and  consider  the 
following  family  of  loss  functions.  The  loss  functions 
depend  on  the  cell's  decision  whether  to  fire  or  not,  and 


9 


they  represent  the  intuitive  idea  that  the  neuron  will 
fire  when  its  activity  is  greater  than  some  threshold, 
and  will  not  otherwise.  We  denote  the  firing  of  the 

neuron  by  a  =  1.  Define  K  =  -p  i(s,Qr„)ds. 

The  loss  function  for  a  decision  to  hic  is  given  by: 


A’  -  p 0(s,  ©m  )ds,  (j'  ni)<0„ 


and  for  the  decision  not  to  fire  bv; 


(j  ■  m)  <  ©„ 


[  A' - /j  C)(s,  0„  )ds.  (x  m)  >  0ro 

It  follows  from  the  definition  of  Lm  and  from  the  def¬ 
inition  of  Srn  that 

»(I  rnj 

Afn(j*i^Tn)  —  /  0(s.0Tn)ds 

=  ~  ">)■} 

We  can  write  L,n[^)  instead  of  L„(x.6„)  when  there 
is  no  confusion. 

The  risk  is  given  by; 

Re(S,)^  -|{r;(x.m)^i-  £^(x  m)=;}. 

Since  the  risk  is  continuously  differentiable,  its  mini¬ 
mization  can  be  done  via  the  gradient  descent  method 
with  respect  to  rn.  namely; 

dm.  8 

—  =  -  T —  ReiSe)  =  fJ  E^<t>{i  ■  m.  0„,  ).r.  . 


Notice  that  the  resulting  equation  represents  an  av¬ 
eraged  deterministic  equation  of  the  stochastic  BCM 
modification  equations.  It  turns  out  that  under  suit¬ 
able  conditions  on  the  mixing  of  the  input  j  and  the 
global  function  p,  this  equation  is  a  good  approxima¬ 
tion  of  its  stochastic  version  (Intrator,  1990b).  namely; 

dm, 

~  =  p  0(i  •  m,0„)j,. 
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